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Abstract. Phylogenetic mixture models are statistical models of character evolution 
allowing for heterogeneity. Each of the classes in some unknown partition of the char- 
acters may evolve by different processes, or even along different trees. The fundamental 
question of whether parameters of such a model are identifiable is difficult to address, 
due to the complexity of the parameterization. We analyze mixture models on large 
trees, with many mixture components, showing that both numerical and tree parame- 
ters are indeed identifiable in these models when all trees are the same. We also explore 
the extent to which our algebraic techniques can be employed to extend the result to 
mixtures on different trees. 

1. Introduction 

A fundamental question about any parametric statistical model is whether or not the 
parameters of that model are identifiable; that is, does a probability distribution arising 
from the model uniquely determine the parameters that produced it. Establishing the 
identifiability of parameters is important for statistical inference, especially in models 
where the parameters have a physical or biological interpretation. For example, it is well- 
known that identifiability is a necessary condition for statistical consistency of maximum 
likelihood estimation [HI Chapter 16]. 

In phylogenetics, parameters of interest include the discrete tree parameter and nu- 
merical parameters specifying substitution processes on the edges of the tree. For the 
simplest phylogenetic models, identifiability of both tree and numerical parameters have 
long been established [11]. But as models grow in complexity, with both the combinatorial 
description of trees and the underlying number of numerical parameters increasing, the 
question of identifiability is far from settled. 

A particular class of complex phylogenetic models of growing interest and use are the 
phylogenetic mixture models. Relatively simple examples are the models with small num- 
bers of parameters — including those with T-distributed rates, invariable sites, and com- 
binations of these — that are currently the most commonly used in data analysis. More 
elaborate mixtures allow across-site rate variation with more freedom in the distribution 
of the rate multipliers [15], the use of different rate matrices |18j . or even multiple distinct 
trees each with their own rate and time parameters. Such models may have a large num- 
ber of mixture components. For instance, a Bayesian nonparametric analysis conducted 
in [T5] allowed a variable number of components, with a Dirichlet process prior specifying 
a mean of as many as 20. 

However, only the simplest phylogenetic mixture models have been proven to be identi- 
fiable, typically where the number of parameters are small. The papers [U IH El El QUI EE] 
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contain previous results on identifiability of such models, of various sorts. Note that only 
recently has it been shown that most choices of parameters of the widely-used GTR + I 
+ T model are identifiable [TP] , although for a certain type of rate matrix the question 
remains open. 

Our goal in this paper is to develop methods to prove identifiability in phylogenetic 
models that are considerably more complex than in previous work. In particular, we 
investigate the identifiability of phylogenetic models with many mixing components. A 
consequence of our methods is the following theorem: 

Theorem 1.1. For an r-component identical tree mixture of the general Markov model 
of character evolution with K-state random variables on an n-leaf binary phylogenetic 
tree, both the tree parameter and the numerical parameters are generically identifiable if 
r < ftl™/ 4 !- 1 . 

By an identical tree mixture model we mean a mixture of probability distributions com- 
ing from the same topological phylogenetic tree. More complicated mixture models might 
have each distribution arising from a different topological tree. Theorem 11.11 improves 
substantially over past identifiability results on identical tree phylogenetic mixture mod- 
els. Previously, it was only known that the tree parameter is identifiable, and then only 
in the case that r < k (work of Allman and the first author [6]). This new theorem quan- 
tifies the intuition that larger taxon sets should allow for identifiability of more complex 
models, and is an exponential improvement over previous results. 

Our strategy of proof is to combine two techniques coming from the algebraic study 
of phylogenetic models. First, we use the representation of probability distributions in a 
phylogenetic model as tensors with small tensor rank and employ a theorem of J. Kruskal 
to uniquely identify components of that tensor. Second, we use phylogenetic invariants 
as tools to identify deeply embedded features of phylogenetic trees, and to "untangle" 
probability distributions that have been shuffled together by the tensor analysis. While 
each technique by itself is only able to make a small advance on the identifiability problem, 
when combined they give dramatically stronger results. Background on these general 
techniques appears in Section [31 and the proofs of the main theorems are in Section HI 

Our techniques actually extend to mixtures from different trees provided they all share 
a certain type of common substructure. It is in this generality that we prove our main 
results, Theorems 14.61 and 14. 7\ with Theorem 11.11 arising as a corollary. 

The assumption of any common substructure in the trees is of course false in some 
biological situations modeled by mixtures. For instance, if the mixture is due to the 
coalescent process modeling incomplete lineage sorting on a species tree of populations, 
then components will be present from all topological gene trees [T3| [22]. However, one 
might also model lateral gene transfer at a number of (unknown) locations in a tree as a 
mixture, and for this the assumption of common substructure could be quite plausible. 

2. Preliminaries 



2.1. Mixture models. Consider the general Markov model of K-state character evolu- 
tion, GM(k), on n-taxon trees (e.g., k — 4 corresponding to DNA sequences). We assume 
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the taxa labeling the leaves are identified with [n] = {1,2, ... ,n}. Then for each rooted 
leaf-labeled tree T, there is a parametrization map ipT giving the joint distribution of 
states at the leaves of the tree T as functions of continuous parameters, which specify the 
state distribution at the root and the transition probabilities on the edges. Let St denote 
the continuous parameter space of GM(k) on T, which is a full dimensional subset of 
some lR m . Then 

Tp T :S T ^ A K ' l -\ 

where A £_1 C [0, 1]^ denotes the probability simplex comprised of non-negative real vec- 
tors summing to 1. The image of this map is the phylogenetic model M.t Q A k " _1 . 

The associated r-component mixture model has the following parametrization: For 
every r-tuple of trees T = (T\, T 2 , . . . , T r ) on the same taxa [n], let St = Sri x • • • x St t x 
A r_1 and let 

be defined by 

^r(si,...,s r ,7r) = 7TiV»z\(si) H h ir r ip Tr (s r ) . 

Thus 7r is the vector of mixing parameters; each 7Tj gives the proportion of i.i.d. sites that 
evolve along tree Tj with parameter vector Sj. The r-component mixture model on T is 
the image of the map ip T , and is denoted 

Mt = Mt± * Mt 2 * • • • * Mt t - 

Clearly A4t depends only on the unordered multiset of the trees in T. In the case where 
Ti = T for all i, we call this an r-component identical tree mixture model on T. 

We focus on the mixture models built from the basic model GM(k) in this paper, as 
these are quite general algebraic models, for which the maps ipT are naturally defined 
by polynomial formulas. Many models which are not polynomial (in particular, those 
built from the general time-reversible model) can be embedded in them. The polynomial 
structure of algebraic models allows them to be studied using techniques from algebraic 
geometry. 

2.2. Identifiability of parameters. For algebraic models, it is convenient to slightly 
weaken the notion of identifiability of parameters to generic identifiability. The word 
"generic" is used to mean "except on a proper algebraic subvariety" of the parameter 
space. (See section 13.31 for a formal definition of variety.) Although it is sometimes 
possible to be explicit about this subvariety, we usually are not, since the key point in 
interpretation is that a proper subvariety is a closed set of Lebesgue measure inside the 
larger set. Thus regardless of the precise subvariety involved, randomly chosen points are 
generic with probability 1. 

On an unmixed GM(k) model on a single tree T, there are several well-understood issues 
with identifiability of parameters. First, at any internal node of the tree, in a phenomenon 
called label swapping, one may permute the names of the state space of the corresponding 
hidden variable (permuting the columns or rows of the Markov matrices on edges leading 
to or from the node) with no effect on the probability distribution. Second, while the 
standard parameterization of the GM(k) model on a tree T requires specification of the 
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root of T, for generic choices of parameters one can relocate the root (with an appropriate 
uniquely determined change to the parameters, up to label swapping) with no effect on 
the probability distribution. Third, if any internal nodes of T have degree 2, they may 
be suppressed and the Markov matrices on incident edges combined, with no effect on 
the probability distribution. Thus one generally assumes trees have no such nodes. For 
simplicity, we do not always explicitly refer to these issues in our formal statements in this 
article. However, we will occasionally use the second fact to choose a convenient location 
for a root of a tree in our arguments. 

That these are the only issues for parameter identifiability for the unmixed model is 
the content of the following theorem, which was essentially shown in [TT] . 

Theorem 2.1. For the GM(k) model on a single tree, 

(1) The unrooted tree parameter is generically identifiable, in the class of binary trees. 

(2) For a fixed binary tree T, the numerical parameters of the GM(n) model on T are 
generically identifiable, up to label swapping at internal nodes of the tree, and an 
arbitrary choice of a node as the root. 

An additional issue for identifiability of r-tree mixtures is component swapping: Inter- 
changing the trees along with their parameters, while permuting the mixing parameters in 
the same way, has no effect on the resulting distribution. A useful notion of identifiability 
must allow for this. 

Definition 2.2. The tree parameters of the r-tree mixture are generically identifiable 
if for any binary trees T = (T 1; . . . ,T r ) on the same set of taxa, and generic choices of 
parameters si, . . . , s r , tt, 



implies that T = a ■ T' for some a £ & r , the symmetric group of permutations. 

We also investigate identifiability of tree parameters when restricting to specific classes 
of r-tuples of trees. For example, Theorem 11.11 concerns identifiability of tree parameters 
among all sets T = {Ti, . . . , T r }, where T\ — ■ ■ ■ — T r . Our main results, Theorems 14.61 
and 14.71 concern identifiability in the class of r-tuples of trees that all contain a specified 
deep common substructure, whose precise definition will be given in Section HI 

Definition 2.3. The continuous parameters of an r-tree mixture on T are generically 
identifiable if for generic choices of si, . . . , s r and 7r, 



implies that there is a permutation o £ & r such that a • T = T, s[ = s a (A, and 7r- = 7r CT (i) 
for i — 1, . . . , r. 

Note this definition only allows the swapping of continuous parameters Sj, 7Tj with Sj, ttj 
when Tj = Tj. 



V>t(si 



s r ,7r) = i) TI (s' lJ ...,s f r ,Tr') 



ih(si, ■ ■ ■ 
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2.3. Splits and tripartitions. We will use the combinatorial notion of a split of the 
leaves of a tree associated to an edge in a binary tree, as well as the analog of this concept 
for a node of the tree. 

Definition 2.4. A split of [n] is a bipartition A\B of [n] with two nonempty elements. A 
split is said to be compatible with a tree T if it arises as the partition of leaves induced 
by an edge in some binary resolution of T. 

Similarly, a tripartition of t4|5|C of leaves is said to be compatible with T if it arises 
as the tripartition induced by an interior vertex in some binary resolution of T. 

A collection of trees is said to have a common split (or tripartition) if the split (or 
tripartition) is compatible with every tree in the collection. 

A collection of trees has a common tripartition A|5|c7 if, and only if, it also the three 
common splits A\B U C, B\A U C, and C\A U B. For a binary tree, these are the splits 
associated to the edges radiating from the vertex inducing the tripartition. Note also that 
our definition of compatible splits differs from the standard definition (e.g., in [19]) in 
the case of trees with polytomies. Our notion is more useful when studying geometric 
properties of phylogenetic models. 

3. Tensors and Invariants 

The two main tools we use to prove our results are Kruskal's theorem on uniqueness 
of tensor decompositions and phylogenetic invariants. In this section, we describe these 
tools. Both are connected to the notion of a flattening of the probability distribution 
arising from a phylogenetic model. 

3.1. Tensors and Unique Decomposition. By a tensor, we mean simply an n-way 
rectangular array of numbers. A 2-way tensor is thus a matrix. 

For j = 1, 2, 3, let Mj be an r x Kj matrix with ith row nrj = (mf (1), • • • , m l{ K j)- Let 
[Mi, M 2 , M 3 ] denote the 3-way K\ X k 2 X k 3 tensor defined by 

r 

[Mi, M 2 , M 3 ] = m ! ® m2 i ® m l 
i=i 

In other words, [Mi, M 2 , M 3 ] is an K\ x k 2 x k 3 array whose (u, v, w) entry is 

r 

[M X ,M 2 ,M 3 ]„ = J2 m i( u ) m i(^i(w). 

i=i 

Every 3-way tensor can be expressed in this way, for sufficiently large r. A nonzero tensor 
of this form with r = 1 is said to have tensor rank 1. More generally, the minimal r such 
that a 3-way tensor can be decomposed as such a sum is called its tensor rank. A natural 
question is when this expression is essentially unique. 

Note there are two basic operations on the matrices M\, M 2 , M 3 which leave unchanged 
the tensor [Mi, M 2 , M 3 ]: one can simultaneously permute the rows of the three matrices 
Mi, M 2 , and M 3 , or taking three numbers cti, a 2 , a 3 such that aia 2 a 3 = 1, one can replace 
the ith rows mj by Oj-m^. Kruskal's Theorem [TH1 EH] describes a situation where these 
operations lead to the only variants in a tensor decomposition. 
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Given anrXK matrix M, its Kruskal rank, denoted rank^(M), is the largest value k such 
that every subset of k rows of M is linearly independent. Note that rankx(M) < rank(M). 

Theorem 3.1 ( [IE! [17]). Let I 5 = rank#(M,-), where Mj is r x k s . If 

h + I 2 + h>2r + 2 

then [Mi, M 2 , M 3 ] uniquely determines Mi,M 2 ,M 3 up to simultaneous permutation and 
scaling of the rows. 

Kruskal 's theorem has proven useful for proving identifiability results of numerical pa- 
rameters for both phylogenetic models [9] and for other statistical models with hidden 
variables [2J [3] . We will show how to combine this with other algebraic techniques to also 
deduce identifiability of tree parameters. 

3.2. Flattenings. While Kruskal's theorem concerns 3-way tensors, the tensors arising 
in phylogenetics are usually n-way k X • • • X k tensors, corresponding to the n leaves of 
a phylogenetic tree. We will make frequent use of flattenings of n-way tensors to lower 
order tensors. A flattening of a n-way tensor is simply a reorganization of that tensor as 
a fc-way tensor, with k < n, of larger dimensions. We take a k% x ■ • • x n n tensor M, with 
typical entry M(ui, . . . ,u n ), and a partition Ai|t4 2 | ■ • • \Ak of [n], and we represent this 

SIS db 

Y[ K a X X Y[ K a 

tensor M. The (m, . . . , u n ) entry of M becomes the ((u a )aeAn ■ ■ ■ > (u a )aeA k ) entry of M. 
That is, the indices for the new tensor M are vectors of indices from the tensor M. 

Given a partition Ax|A 2 | • • • of [n], we denote the corresponding flattening of M by 
Y\at Al \ A2 \...\ Ak (M). 

3.3. Invariants, Phylogenetic and Otherwise. We begin with a little background on 
algebraic geometry (see [12] for more detail). Let M[pi, . . . , p m ] be the set of all polynomials 
in the variables (or indeterminates) pi,p 2 , ■ ■ ■ ,Pm, with coefficients in the real numbers, 
R. Algebraic geometry studies the zero sets of collections of polynomials. That is, to a 
collection of polynomials fi, f 2 , . . . , fk G K[pi, • • • ,p m ] we associate the variety 

V(fi, ...,/*) = {a e W n : A(a) = / a (a) = • • • = /*(a) = 0} . 

The fact that these geometric sets arise from polynomials vanishing implies they have 
important structural features. 

Varieties arise in studying statistical models through describing models implicitly, 
rather than parametrically. For a fixed statistical model M. C A m_1 , an invariant of 
M. is a polynomial / G K[pi, . . • ,p m ] such that /(a) = for all a e Ai. In the case where 
Ai is a phylogenetic model, such a polynomial is called a phylogenetic invariant. 

Our main use in this paper for phylogenetic invariants is their connection to generic 
identifiability, through the following basic proposition from algebraic geometry. 

Proposition 3.2. Let Vq and V\ be two irreducible algebraic varieties, such as those 
arising from parameterized statistical models. Suppose fo is an invariant for Vq, and 
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there exists a point p\ G V\ with fo(pi) 7^ Then V\ $2 Vq, and the variety Vq fl V\ is of 
lower dimension than V\. That is, generic points on V\ lie off of Vq. 

Among the most important and elementary phylogenetic invariants are the ones that 
arise from edge flattenings of tensors. 

Definition 3.3. Let A\B be a split compatible with the tree T. An edge invariant for 
T is a phylogenetic invariant that can be expressed as a minor {i.e., the determinant of a 
submatrix) of the matrix Flatus (-P). 

As an indication of how edge invariants can be used to identify combinatorial infor- 
mation on the tree underlying a phylogenetic model, we recall the following theorem 
concerning models on a single tree. While this statement is well-known in the phyloge- 
netic invariants literature, Theorem 14.11 of this article provides a more general extension 
to mixture models. 

Theorem 3.4. Suppose that Tq and T\ are two n-leaf trees such that for i = 0,1, Ai\B{ 
is a split compatible with Ti and incompatible with Ti-i, and let Aii denote the n-state 
general Markov model GM(k) on Ti. Then the (k + l)-minors of Flat^i^P) vanish on 
M.T t and do not vanish on M.T 1 _ i , a> n d thus are edge invariants for the first model but 
not the second. In particular, edge invariants can be used to generically identify the tree 
topology. 

Edge invariants have been the phylogenetic invariants most interesting for tree identifi- 
ability in the past, and contain enough information to reconstruct the combinatorial type 
of a single tree in some situations. However, we need some more complicated invariants to 
get more information in the case of the phylogenetic mixture models considered here. We 
describe these invariants, discovered in several different contexts [3 [20], in matrix form. 

Theorem 3.5. Let Pbeanxuxn tensor giving a distribution from the GM{n) model 
on a 3-leaf tree. For % — 1, . . . , k, let Pu) be the matrix slice Pu\ = {P{i, u, v)) U)V . Then 

P d) ( ad J P U)) P (k) - P (k) (adj P w ) P (i) = 0. 

Here adj A denotes the classical adjoint of A, which is given by polynomial expressions 
in the entries of A. In the case of nonsingular A, adj (A) = det(A)A~ 1 . 

4. IDENTIFIABILITY OF MIXTURE MODELS WITH COMMON SUBSTRUCTURE 

In this section, we prove our main result, that both tree parameters and numerical pa- 
rameters are generically identifiable in a phylogenetic mixture model provided we restrict 
to multisets T of trees that all share a certain substructure. More precisely, we require 
that all trees in T have two splits in common. The number of mixing components that 
can be identified via our techniques will depend on the sizes of the sets in these splits. As 
a corollary, we deduce Theorem 11.11 after showing that if all trees are the same, there is 
a "deep" internal vertex with two of its incident edges giving the requisite splits. 

Before proceeding to the statements and proofs of the main theorems, we prove three 
lemmas. 
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Lemma 4.1. (Edge invariants for tree mixtures) 

Consider the GM(k) mixture model on r trees T = (Ti, . . . ,T r ). Let A\B be a biparti- 
tion of the taxa, with r < min(K* A_1 , 

(1) If A\B is compatible with all trees in T, then all l)-minors of Y\sXa\b{P) 
vanish for all distributions P arising from the model. 

(2) If A\B is not compatible with at least one tree in T, then for generic distributions 
P arising from the model at least one (rn + l)-minor of Flat a\b{P) does not vanish. 

Proof. The claims concerning (non) vanishing of minors are equivalent to claims that 
Flat {P) has rank at most rn in case ([1]), and generically has rank greater than rn 
in case 02]) • Therefore we focus on investigating ranks of flattenings. 

If A\B is compatible with all trees in T, then, by passing to binary resolutions of the 
Ti, we may assume it is a split associated to edge = (c^, hi) in Tj. Then one sees that 

Flat A , B (P) = M T A QM B . 

Here Q is the rn x rn block-diagonal matrix whose ith kx k block gives the joint probabil- 
ity distribution of states for the random variables at and bi, weighted by the component 
proportion 7Tj. The matrices Ma, Mb are stochastic, of sizes rn x k# a , rn x k# b , with en- 
tries in the ith block of k rows giving probabilities of states of variables in A, B conditioned 
on states at cij,&j. This factorization implies the claimed bound on the rank. 

Suppose next that A\B is not compatible with at least one of the trees in T, say Ti. To 
show that Flatus (P) generically has rank greater than rn, it is enough to give a single 
choice of parameters producing such a rank. Indeed, this follows from Proposition 13.21 
applied to the model and the variety of matrices of rank at most rn. 

To simplify this choice, for each Ti with i > 1 choose all Markov matrices for all internal 
edges of Ti to be the identity, I K . Since T\ is not compatible with A\B, by Theorem 3.8.6 
of [19], it has an edge e = (c, d), with associated split C\D, such that all four sets A PI C, 
An D, B flC, B n D are nonempty. For all internal edges of T\ except e, choose Markov 
matrices to be I K as well. Since the effect of an identity matrix on an edge is the same as 
contracting that edge, with these choices we need henceforth argue only in the following 
special case: for i > 1, T^ is a star tree with central node a^, and T\ has the form of two 
star trees, on C and on D, that are joined at their central nodes by e. 

Now express the distribution P = P\ + P' where Pi is the mixture component from Ti, 
and P' the sum of the components on the star trees T2 = ■ ■ ■ = T r . Then, one can write 

M 2 := Flat A | B (P') = N T A RN B , 

with R an (r - l)/t x (r — 1)k diagonal matrix giving the distribution of states at tjj 
in components 2, . . . ,r weighted by the 7Tj, and Na, Nb are stochastic matrices of sizes 
(r — 1)k x k# a , (r — 1)k x k# b with entries giving conditional probabilities of states of 
variables in A, B conditioned on states/components at the aj. By choosing positive root 
distributions at the nodes and positive 71$, we ensure R will have positive diagonal 
entries, and hence have full rank. Furthermore, the rows of Na, Nb are formed from the 
tensor product of corresponding rows of the Markov matrices on the edges of the star trees, 
and are thus generalized Vandermonde matrices. (Recall that if fx, . . . , f t are a linearly 
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independent set of polynomials, and u±, . . . ,u s are points, the generalized Vandermonde 
matrix is the matrix s x t matrix with i,j entry fj(ui). Here the polynomials fj are 
determined by the formulae for the entries in the tensor product of the rows, and the Ui 
by the entries in the Markov matrices.) A generalized Vandermonde matrix has full rank 
for generic choices of u±, . . . , u s . Since (r — 1)k < min(K* j4 , K^ B \ for generic parameters 
M 2 has rank (r — 1)k. 

On the other hand, consider Pi, where we choose all matrices on pendant edges of T\ 
to be I K , and both the root distribution at c and M e to have all positive entries. Then 

Mi := Flat A | B (Pi) = N^R^b, 

where Pi diagonal matrix with entries giving the joint distribution at c and 

d weighted by 7Ti, and Aq^, Aq^ have all zero entries except for a single 1 in each row, 
and full row rank. Thus Mi has rank k 2 . Moreover, it has at most one non-zero entry in 
each row and column, so both im(Mi) and ker(Mi) are coordinate subspaces. 

Since Flatus (P) = Mi + M 2 , our goal is to show that rank(Mi + M 2 ) > rn for generic 
choices of the parameters not yet specified (the Markov matrices on the trees T 2 , . . . , T r ). 
Without loss of generality assume that j^A > #B, so to do this it is enough to make 

(1) rank(Mi + M 2 ) = min((r - 1)k + k 2 , k #b ). 

We use the following facts about matrices: Let X and Y be s x t matrices. With 
im(X),ker(X) denoting the image and kernel of X as a linear transformation from IR* to 
W, then im(X)nim(V) = implies ker(X + Y) = ker Xnker Y. Also, if nullity (X + V) = 
nullity (A) + nullity (Y) — t, then by the rank/nullity theorem rank(A + Y) = rank(A) + 
rank(y). 

First consider the case where (r — 1)k + k 2 < . By the preceding paragraph, to 
show equation (JTJ it suffices to choose parameters so that im(Mi) PI im(M 2 ) = and 
dim(ker(Mi) n ker(M 2 )) = nullity(Mi) + nullity(M 2 ) - k* b . 

Since generically Na and Nb have full rank, it follows that im(M 2 ) = irn(Aj) and 
ker(M 2 ) = ker(A^). But im(Mi) is a coordinate subspace, so it intersects im(Xj) non- 
trivially if and only the submatrix of Aj obtained by deleting rows corresponding to those 
coordinates has nontrivial kernel. That submatrix is a (k^ a — k 2 ) x (r — 1)k generalized 
Vandermonde matrix with — k 2 > k# b — k 2 > (r — 1)k, so it has full column rank. 
This proves that im(Mi) D im(M 2 ) = generically. 

Since ker (Mi) is also a coordinate subspace, its intersection with ker(M 2 ) = ker(A^) 
is isomorphic to the kernel of the submatrix of A# obtained by deleting the columns 
corresponding to required zero entries in vectors in ker(Mi). Since this submatrix is a 
(r — 1)k x (k# b — k 2 ) generalized Vandermonde matrix, the dimension of this kernel is 

K #B _ K 2 _ (r _ i)k = {K #B _ ^ + {K #B _ (r _ 1)k) _ K #B_ 

Thus dim(ker(Mi) n ker(M 2 )) = nullity(Mi) + nullity(M 2 ) - k* b , so rank(M x + M 2 ) = 
(r — 1)k + k 2 . 

In the case where (r— 1)k+k 2 > k& b the same arguments as above apply after modifying 
our choices so all but k& b — (r — 1)k of the entries of Pi are zero. Then we deduce that 
we can choose M 2 so that rank(M x + M 2 ) = (r — 1)k + k& b — (r — 1)k = k& b . □ 
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Picking any internal vertex of a binary tree, the induced tripartition of the leaf variables 
allows us to create 3 agglomerate variables. In this way, we can view a phylogenetic model 
as one to which we can apply Kruskal's theorem. More specifically, consider a probability 
distribution P in the GM(k) mixture model on trees T = (T 1; . . . , T r ), where the T share 
a common tripartition A|S|C of the leaves, arising from the vertices Wj. Suppose Pj is 
the weighted mixture component from Tj in P. Then from the parameters on Tj, one 
can give k x k^ a , k x k# b , k x k# c stochastic matrices M^a, M^b, Mi t c of conditional 
probabilities of states at the leaves in A, B, C, given the state at t>j. Letting M^a be the 
matrix obtained from M^a by multiplying rows by the corresponding entry of the root 
distribution at v\ and by the weight 7Tj, one checks that 

Flat A | S |a(Pi) = [M iA ,M hB ,M^ c ]. 

Let Ma denote the rn x k# a matrix obtained by stacking the M^a, arid Mb, Mq similarly 
be matrices obtained by stacking the M^b, M^c- Then 

Flat A | B | C (P) = [M A ,M B ,M C ]. 

To apply Kruskal's theorem to this flattening, we must first show that the technical 
conditions on Kruskal rank of the matrices apply, at least generically. 

Lemma 4.2. Consider an r-fold GM(k) mixture model on trees T = (T 1; . . . ,T r ) with a 
common tripartition A\B\C of the leaves. Then 

¥\&t A \B\c{P) = [M A ,M B ,M C } 

for some matrices Ma, Mb, Mc with rn rows. Moreover, for generic choices of the nu- 
merical parameters these matrices all have full Kruskal rank (i.e., Kruskal row rank equal 
to their smaller dimension). 

Proof. The first claim was established in the discussion preceding the lemma. 

For the second, by similar reasoning as was used in Lemma |4~T| it is enough to show one 
choice of parameters gives these matrices full Kruskal rank. By choosing matrix parame- 
ters on all internal edges of every Tj to be the identity matrix, we may essentially assume 
every Tj is the star tree, rooted at central node t>j. Choosing positive root distributions at 
Vi, and positive mixing parameters 7Tj, it then suffices to only consider one set of leaves, 
say A. 

Now, as in the discussion of Na in the proof of Lemma |4~T| one sees that Ma is a gener- 
alized Vandermonde matrix. Since all its submatrices are also generalized Vandermonde 
matrices, it generically has full Kruskal rank. □ 

The next lemma allows us to tease apart distributions which arise from mixing together 
slices of distributions from different trees. After we have applied Kruskal's Theorem via 
Lemma 14.21 it will be used to identify which rows of the matrices arise from the same 
mixture component of the model. 

Lemma 4.3 (No Shuffling Lemma). Let T, Ti,...,T r be trees with n > 3 leaves, or 
n > 4 leaves if k = 2. For i — 1, . . . , r, let Pi be a generic probability distributions from 
the GM(k) model on the tree T i; scaled by positive constants 7T». For a fixed choice of 
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j G [n], let A\B = {j}\([n] \ {j}) and form the flattenings Flat^^Pi). Form a new 
matrix from any k rows from these flattenings (with repeats allowed), and define Q so 
that Flat a\b(Q) is this matrix. Then Q does not satisfy all the phylogenetic invariants for 
T unless the chosen rows come from a single Pi and T is a refinement o/Tj. 

Proof. Note that the multiplication by the 7Tj has no effect on whether the tensor satis- 
fies non-trivial invariants, because the phylogenetic varieties for the GM(k) model are 
invariant under the action of the general linear group at any leaf [8] . 

Consider first the case that n = 3, and k > 3. Suppose Q is constructed from rows 
which come from at least two different Pj. Without loss of generality, we assume j = 1, so 
that in the notation of Theorem 13.51 the slices Qu\ contain the entries of Q arising from 
a single row of the flattening. We will show that Q does not satisfy the invariants of that 
theorem. 

For the time being, treat two of these slices Q(i), Q<2) as fixed, and the third slice Q@), 
which we may assume does not come from the same Pi as either Qm or Qm-\, as a variable. 
Generically, the matrix equation 

(2) Q(i) (adj Q (2) ) Q (3) - Q (3) (adj Q (2) ) Q(i) = 

then gives nonzero, linear constraints on the entries of Q(3). 

However, for an arbitrary matrix Q(3) with positive entries whose sum is less than 1, we 
can find a Pj that has Q(3) as an Y designated slice. This shows that there exist such slices 
not satisfying equation ([5]), and hence, by Proposition 13.2} that the generic slice does not. 

When k = 2 and n = 3, there are no non-trivial invariants for GM(k) (those of Theorem 
13.51 are identically zero), hence we consider n = 4, and use the edge invariants of Theorem 
13.41 But for any choice of 4-leaf tree, and choice of index j 6 {1,2}, we can find a Pi 
in the tree model so that Flatus (P) has any desired generic vector as its jth row. Now 
Q is built from two such rows. If the Pi and P 2 that we take these slices from are not 
the same, then generically, we can choose those slices to be arbitrary vectors. But then 
the flattening of Q with respect to the split of T will generically be a rank 4 matrix, and 
hence Q will not satisfy the invariants for tree T. 

For larger n, the result follows from the cases above by marginalization to 3- or 4-leaf 
trees. □ 

First we prove a theorem on the generic identifiability of numerical parameters in trees 
with a known common tripartition. 

Theorem 4.4. Suppose the trees T = (Ti,...,T r ) have a known common tripartition 
A\B\C, with #A> #B > #C, and r < k# b ~ 1 . If k = 2 also suppose #A > 3. Then 
both T and the numerical parameters of the GM(k) mixture model on T are generically 
identifiable. 

Proof. Since the trees in T share a common tripartition A|P|C, by Lemma [4.21 if a dis- 
tribution P arises from generic parameters of the model then 



Flat A | B | C (P) = [M A ,M B ,M C ], 
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where Ma, Mb, and Mc all have full Kruskal row rank, which will be min(rft, k# a ), 
min(rK, k* b ), and min(rK, respectively. According to Theorem 13. 1[ these matrices 

are uniquely determined up to simultaneous permutation and scaling of the rows provided 

(3) min(r/t, K^ A ) + min(r/t, + min(rK, k^ c ) > 2rn + 2. 

Since k > 2 and j^-C > 1, this inequality holds for all r < 

At this point, we have recovered the matrices Ma, M b , and M c up to scaling and 
permuting the rows. Each of the rows of the recovered Ma will have entries from a scaled 
slice from a tree distribution on a subtree of one of the Tj (the subtree spanning the vertex 
Vi and all the leaves A). We need to group these rows by the mixture components they 
come from. However, the No Shuffling Lemma 14.31 says that generically it is possible to 
do this. Since ordering the rows of Ma determines an order of the rows of Mb, Mc, we can 
then reassemble the flattened mixture components Pi as the product [M^a, M^b, M^c] of 
appropriate submatrices M^a, M^b, M^c of Ma, Mb, Mc- 

From Pj, we recover the mixing weight 7Tj via 

Then, by Theorem 12.11 the tree Tj and the numerical parameters on it can be identified 
from Pi/iTi □ 

Now we proceed to prove identifiability of the numerical parameters and tree parameters 
in our most general class of r-tree mixture models, the j-deep class. 

Definition 4.5. For a positive integer j, the j-deep class of r-tuples of trees T consists of 
all r-tuples of binary trees such that there exists a tripartition y4|5|C with #A, j^B > j, 
j^-C > 1, such that the splits A\B U C and A U C\B are compatible with all trees in T. 

Note that this definition does not require that C\A U B be compatible with any of the 
trees in T, so the full tripartition need not be associated to vertices in the Tj. The trees 
must only share two splits, each sufficiently deep in the tree. Furthermore, if T is in the 
j-deep class, we do not assume the tripartition is known, only that it exists. 

We now prove our main theorems on identifiability of parameters in r-tree mixtures. 
We state two versions, one for when a j-deep tripartition is known (including the case of 
when all the trees are known), and one for when it is not. The second of these requires a 
slightly stronger hypothesis on the number of mixture components. 

Theorem 4.6. Suppose T is in the j-deep class via a known tripartition A\B\C . Then 
both T and the numerical parameters of the GM(k) mixture model associated to T are 
generically identifiable provided r < k^ 1 and either k > 2, or k = 2 and j^A > 3. 

Proof. Fix some c G C, let D(c) = AUBU{c}, and let P c = P\d(c) De the marginalization 
of P to the leaves in D(c). This is a probability tensor for the mixture of induced trees 
T 1 2j(c) 5 with numerical parameters obtained by restricting to these induced trees. Note 
that the trees in T|_d( c ) share the common tripartition A|S|{c}. Thus Theorem 14.41 applies 
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to identify the trees T|_d( c ) and numerical parameters on them. Then by Lemma [4.21 we 
may write 

Flat A | B | {c} (P c ) = [M A ,M B ,M C ], 

and for generic choices of the numerical parameters, these matrices all have full Kruskal 
row rank. We may further specify that the rows of these matrices, in particular M A , have 
been ordered into r blocks of k rows, corresponding to the various mixture components. 

Note that since the matrix Ma has full Kruskal row rank and is rn x k* a with tk < k# a , 
it has full row rank. Thus we may compute a left inverse Qa, with MaQa = Irm the 
rn x rn identity. 

Returning to the consideration of the full distribution P and trees T, we use Qa to 
disentangle the mixture components. In each Tj let Wi be the node in the subtree spanning 
A through which this subtree is connected to all other leaves. Then 

FlaW, A (P) = M^ U0 UM A , 

where Ma, Mbuc are stochastic matrices of probabilities of states at the leaves in A, BUC 
conditioned on components and states at the Wj, and II is a diagonal matrix with entries 
the product of the mixing weights, m, and the root distributions at Wj. While the ordering 
of the mixture components and root states in these matrices is arbitrary, we may assume 
it is the same as in the rows of M A - Then 

M A = RM A , 

where and R is a block diagonal matrix whose ith block gives conditional probabilities of 
state changes from t>j to Wi on Tj, and is generically invertible. 
Thus 

F\ a t BUC \ A (P)QA = mI^ur-'MaQa = M^ uc nR-\ 

This shows that by taking the columns of Flat buc\a{P)Q 'a in blocks of k we obtain entries 
associated to only one mixture component at a time. Moreover, multiplying a block of 
these columns by the corresponding block of rows of Ma = RMa, we obtain a flattened 
form of a single mixture component 7TjPj. 

Summing the entries of 7TjPj identifies 7Tj, and hence Pj. Then by Theorem 12. II the tree 
Tj and the numerical parameters on it are identifiable. □ 

Theorem 4.7. Suppose T is in the j-deep class. Then both T and the numerical param- 
eters of the GM(k) mixture model associated to T are generically identifiable provided 
r < /c 3 ' -1 . 

Proof. Since the T is in the j-deep class and rn < k# a , k# b , for generic parameters we can 
use the edge invariants of Lemma f4.ll to find two splits A\B U C and B\A U C compatible 
with all trees in T, with j^A > j^B > j, j^C > 1, simply by testing for all splits of an 
appropriate size. 

If k — 2, then 2 < r < k?* 1 implies j > 3, so j^A > 3. Thus for any k > 2, Theorem 
14.61 applies to give the conclusion. □ 

We are now in a position to deduce Theorem ll.lt which will follow from Theorem 14.71 
and the following lemma. 
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Lemma 4.8. Let T be an unrooted binary tree with n > 3 leaves. Then there exists an 
internal vertex v inT inducing a tripartition A\B\C such that two of the three components 
contain at least |~n/4] leaves ofT. 

Proof. According to Exercise 1.5 in [19], every tree has a centroid v, which is an internal 
node such that each component of T\v has at most \V\/2 vertices where V is the number 
of vertices of T. This same statement holds if we replace V with n and vertices with 
leaves in the definition of the centroid. Since the tree T is binary and v is an internal 
vertex, there are three components of T\v. The largest component has at least \n/3] 
leaves and at most \n/2\. Thus there are at least \n/2] leaves remaining between the 
other two components, which implies that in the most balanced case, one of the other two 
components has at least |~n/4] leaves. Since \n/3] > |~n/4] this proves the claim. □ 

Simple examples show the bound \n/4\ in this lemma is the best possible. 

Proof of Theorem According to Lemma I4.8[ there is an internal vertex of T inducing 
a tripartition A\B\C such that #A > #5 > \n/A\ and #C > 1. Thus T = (T, . . . , T) is 
in the [n/4] -deep class. Theorem 14. 71 then applies. □ 

5. Further Directions 

The techniques employed in this paper have been primarily concerned with, and are 
effective for, the identification of parameters in mixture models where the underlying trees 
share large common substructures. Establishing identifiability of either numerical or tree 
parameters in situations where there is no commonality between the trees remains an 
open problem. 

Even in the case of general Markov mixtures of two 4-leaf trees little is understood: 
First, in the case of two different tree topologies being mixed, it is unknown if the tree 
parameters are generically identifiable. Second, if the two trees are given, it is unknown if 
numerical parameters are generically identifiable. These problems might be addressed by 
finding stronger versions of the tensor rank results we have employed {e.g., a strengthened 
version of Kruskal's theorem). But it also seems likely that a solution to these these 
problems will require the development of new mathematical techniques. 
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